image.png

Table of Contents
1 Data Cleaning
   1.1 Data Cleaning Strategies
   1.2 Result
2 Data Exploration
   2.1 Overview
   2.2 Response Variable
   2.3 Numerical Variables
   2.4 Charater Variables
   2.4 Correlation Check
3 Data Processing
   3.1 One-Hot Encoding
   3.2 Splitting Dataset into Training and Test Set
   3.3 Feature Scaling
4 Data Modeling
   4.1 CatBoost
         4.11 Model Building
         4.12 K-Fold Cross Validation
   4.2 Random Forest
         4.21 Grid Search
         4.22 Model Building
         4.23 K-Fold Cross Validation
   4.3 XGBoost
         4.31 Grid Search
         4.32 Model Building
         4.33 K-Fold Cross Validation
   4.4 Logistic Regression
         4.41-4.44 1st~4th Try
         4.45 5th Try - Best Model
         4.46 Model Implications
5 Lead Scoring System

1. Data Cleaning

I used R to clean the data. Please refer to R file if you would like to see more details.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The unprocessed raw data looks like below:

In [6]:
uncleaned = pd.read_csv('Data for Case Study Data Science Online Courses.csv')
In [7]:
def overview(uncleaned):
    overview = pd.DataFrame(uncleaned.dtypes, columns=['Data Types'])
    overview = overview.reset_index()
    overview['Unique Values'] = uncleaned.nunique().values
    overview['Missing Values'] = uncleaned.isnull().sum().values    
 
    return overview

overview(uncleaned)
Out[7]:
index Data Types Unique Values Missing Values
0 Prospect ID object 9240 0
1 Lead Number int64 9240 0
2 Lead Origin object 5 0
3 Lead Source object 21 36
4 Do Not Email object 2 0
5 Do Not Call object 2 0
6 Converted int64 2 0
7 TotalVisits float64 41 137
8 Total Time Spent on Website int64 1731 0
9 Page Views Per Visit float64 114 137
10 Last Activity object 17 103
11 Country object 38 2461
12 Specialization object 19 1438
13 How did you hear about Data Science Online Cou... object 10 2207
14 What is your current occupation object 6 2690
15 What matters most to you in choosing a course object 3 2709
16 Search object 2 0
17 Magazine object 1 0
18 Newspaper Article object 2 0
19 Data Science Online Courses Forums object 2 0
20 Newspaper object 2 0
21 Digital Advertisement object 2 0
22 Through Recommendations object 2 0
23 Receive More Updates About Our Courses object 1 0
24 Tags object 26 3353
25 Lead Quality object 5 4767
26 Update me on Data Science Content object 1 0
27 Get updates on DM Content object 1 0
28 Lead Profile object 6 2709
29 City object 7 1420
30 Asymmetrique Activity Index object 3 4218
31 Asymmetrique Profile Index object 3 4218
32 Asymmetrique Activity Score float64 12 4218
33 Asymmetrique Profile Score float64 10 4218
34 I agree to pay the amount through check or cre... object 1 0
35 A free copy of Mastering The Case Study object 2 0
36 Last Notable Activity object 16 0

1.1 Data Cleaning Strategies

Characters:

  • Put the new label for labels with few results so as to avoid data shift between train and test data.
  • Replace NA values with "Others".
  • Replace "Select" with "Others".
  • Convert variable to factor.

Numerical Variables:

  • Replace NA values with mean variable
  • Remove outliers and fill them with 95% quantile value.

Other Strategy:

  • Change the long variable name into short one to enhance readability

1.2 Result

In [180]:
data = pd.read_csv('Cleaned_Data Science Online Courses.csv', index_col = 0)
In [182]:
overview(data)
Out[182]:
index Data Types Unique Values Missing Values
0 Lead Origin object 4 0
1 Lead Source object 22 0
2 Do Not Email int64 2 0
3 Do Not Call int64 2 0
4 Converted int64 2 0
5 TotalVisits float64 12 0
6 TotalTime int64 1463 0
7 PPV float64 92 0
8 Last Activity object 18 0
9 Country object 39 0
10 Specialization object 19 0
11 HowHear object 9 0
12 CurrentOccupation object 7 0
13 WhatMatters object 4 0
14 Search int64 2 0
15 Magazine int64 1 0
16 Newspaper Article int64 2 0
17 DSForums int64 2 0
18 Newspaper int64 2 0
19 Digital Advertisement int64 2 0
20 Through Recommendations int64 2 0
21 ReceiveUpdates int64 1 0
22 Tags object 27 0
23 Lead Quality object 6 0
24 UpdateDScontent int64 1 0
25 UpdateDMcontent int64 1 0
26 Lead Profile object 7 0
27 City object 6 0
28 Asymmetrique Activity Index object 4 0
29 Asymmetrique Profile Index object 4 0
30 Asymmetrique Activity Score float64 13 0
31 Asymmetrique Profile Score float64 11 0
32 PayThrough int64 1 0
33 FreeCopy int64 2 0
34 LastNotableActivity object 16 0

We can see that there is no missing value after cleaning the data and all of the column names look good.

2. Data Exploration

Business insights can be found in the report slides.

Strategies:

  • Create a crosstab of each feature and Converted
  • Reduce redundancy by merging similar categories

2.1 Overview

In [154]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9240 entries, 1 to 9240
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Lead Origin                  9240 non-null   object 
 1   Lead Source                  9240 non-null   object 
 2   Do Not Email                 9240 non-null   int64  
 3   Do Not Call                  9240 non-null   int64  
 4   Converted                    9240 non-null   int64  
 5   TotalVisits                  9240 non-null   float64
 6   TotalTime                    9240 non-null   int64  
 7   PPV                          9240 non-null   float64
 8   Last Activity                9240 non-null   object 
 9   Country                      9240 non-null   object 
 10  Specialization               9240 non-null   object 
 11  HowHear                      9240 non-null   object 
 12  CurrentOccupation            9240 non-null   object 
 13  WhatMatters                  9240 non-null   object 
 14  Search                       9240 non-null   int64  
 15  Magazine                     9240 non-null   int64  
 16  Newspaper Article            9240 non-null   int64  
 17  DSForums                     9240 non-null   int64  
 18  Newspaper                    9240 non-null   int64  
 19  Digital Advertisement        9240 non-null   int64  
 20  Through Recommendations      9240 non-null   int64  
 21  ReceiveUpdates               9240 non-null   int64  
 22  Tags                         9240 non-null   object 
 23  Lead Quality                 9240 non-null   object 
 24  UpdateDScontent              9240 non-null   int64  
 25  UpdateDMcontent              9240 non-null   int64  
 26  Lead Profile                 9240 non-null   object 
 27  City                         9240 non-null   object 
 28  Asymmetrique Activity Index  9240 non-null   object 
 29  Asymmetrique Profile Index   9240 non-null   object 
 30  Asymmetrique Activity Score  9240 non-null   float64
 31  Asymmetrique Profile Score   9240 non-null   float64
 32  PayThrough                   9240 non-null   int64  
 33  FreeCopy                     9240 non-null   int64  
 34  LastNotableActivity          9240 non-null   object 
dtypes: float64(4), int64(16), object(15)
memory usage: 2.5+ MB
In [156]:
data.describe()
Out[156]:
Do Not Email Do Not Call Converted TotalVisits TotalTime PPV Search Magazine Newspaper Article DSForums Newspaper Digital Advertisement Through Recommendations ReceiveUpdates UpdateDScontent UpdateDMcontent Asymmetrique Activity Score Asymmetrique Profile Score PayThrough FreeCopy
count 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00 9240.00
mean 0.08 0.00 0.39 3.19 479.24 2.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.31 16.34 0.00 0.31
std 0.27 0.01 0.49 2.76 528.82 1.78 0.04 0.00 0.01 0.01 0.01 0.02 0.03 0.00 0.00 0.00 1.02 1.34 0.00 0.46
min 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00 11.00 0.00 0.00
25% 0.00 0.00 0.00 1.00 12.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.00 16.00 0.00 0.00
50% 0.00 0.00 0.00 3.00 248.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.31 16.34 0.00 0.00
75% 0.00 0.00 1.00 5.00 936.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.31 16.34 0.00 1.00
max 1.00 1.00 1.00 10.00 1562.00 6.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 18.00 20.00 0.00 1.00
In [157]:
data.hist(bins = 30, figsize = (20,20), color = '#1169D4')
Out[157]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f9fdf828fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2d45890>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2cdeb90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2d94cd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2b9c990>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2a26d10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe29ed9d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2862210>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2862d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe27e6710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe23caa50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2159dd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2006a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe206ce10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe218ead0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2236e50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2298b10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe2442e90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe24a6b50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f9fe27feed0>]],
      dtype=object)

We can see the following variables have only one value, which means it is useless for further analysis and model building so we drop them.
'DSForums', 'Digital Advertisement', 'Do Not Call', 'Do Not Email', 'Magazine', 'Newspaper', 'Newspaper Article', 'PayThrough', 'ReceiveUpdates', 'Search', 'Through Recommendations', 'UpdateDScontent','UpdateDMcontent'

In [158]:
drop_list = ['DSForums', 'Digital Advertisement', 'Do Not Call', 'Do Not Email', 'Magazine', 'Newspaper', 
            'Newspaper Article', 'PayThrough', 'ReceiveUpdates', 'Search', 'Through Recommendations', 'UpdateDScontent',
            'UpdateDMcontent']
data = data.drop(columns = drop_list)

Asymmetrique Activity/Profile Index

Asymmetrique Activity/Profile Score

In [160]:
fig, axs = plt.subplots(2,2, figsize = (15,10))
plt1 = sns.countplot(data['Asymmetrique Activity Index'], ax = axs[0,0])
plt2 = sns.boxplot(data['Asymmetrique Activity Score'], ax = axs[0,1])
plt3 = sns.countplot(data['Asymmetrique Profile Index'], ax = axs[1,0])
plt4 = sns.boxplot(data['Asymmetrique Profile Score'], ax = axs[1,1])
plt.tight_layout()

I decide to delete those values because:
1) There are too much variation
2) After Googling thoese index/score, I found these values are created by an Indian advertising agency but there is no documentation explaining their meaning and no further analysis can be made due to lack of background information.

In [183]:
drop_list2 = ['Asymmetrique Activity Index','Asymmetrique Activity Score',
                  'Asymmetrique Profile Index','Asymmetrique Profile Score']
data = data.drop(columns = drop_list2)

2.2 Response Variable

In [184]:
converted  = data[data['Converted'] == 1]
nonconverted = data[data['Converted'] == 0]

print("Total =", len(data))

print("Number of leads who converted =", len(converted))
print("Percentage of leads who converted =", 1.*len(converted)/len(data)*100.0, "%")
 
print("Number of leads who did not convert =", len(nonconverted))
print("Percentage of leads who did not convert =", 1.*len(nonconverted)/len(data)*100.0, "%")

sns.countplot(data['Converted']);
Total = 9240
Number of leads who converted = 3561
Percentage of leads who converted = 38.53896103896104 %
Number of leads who did not convert = 5679
Percentage of leads who did not convert = 61.46103896103896 %

2.3 Numerical Variables

Total Visits

In [164]:
# KDE describes the probability density at different values in a continuous variable. 

plt.figure(figsize=(12,7))

sns.kdeplot(converted['TotalVisits'], label = 'leads who converted', shade = True, color = 'r')
sns.kdeplot(nonconverted['TotalVisits'], label = 'leads who did not convert', shade = True, color = 'b')

plt.xlabel('TotalVisits') 
Out[164]:
Text(0.5, 0, 'TotalVisits')

When the number of Total visits is less than 3, there are more leads who did not convert than converted.

Total Time

In [165]:
plt.figure(figsize=(12,7))

sns.kdeplot(converted['TotalTime'], label = 'leads who converted', shade = True, color = 'r')
sns.kdeplot(nonconverted['TotalTime'], label = 'leads who did not convert', shade = True, color = 'b')

plt.xlabel('TotalTime') 
Out[165]:
Text(0.5, 0, 'TotalTime')

When Total Time is less than 600, there are more leads who did not convert. However, when it is more than 600, leads tend to convert.

Pages per visits

In [166]:
plt.figure(figsize=(12,7))

sns.kdeplot(converted['PPV'], label = 'leads who converted', shade = True, color = 'r')
sns.kdeplot(nonconverted['PPV'], label = 'leads who did not convert', shade = True, color = 'b')

plt.xlabel('TotalTime') 
Out[166]:
Text(0.5, 0, 'TotalTime')

There seems to be little difference between leads who converted and those who did not convert in various Total Time.

2.4 Character Variables

In [188]:
def count(df, v1, v2):
    ctr = df[[v1, v2]].groupby(v1, as_index=False).mean().sort_values(v2, ascending=False)
    count = df[[v1, v2]].groupby(v1, as_index=False).count().sort_values(v2, ascending=False)
    merge = count.merge(ctr, on=v1, how='left')
    merge.columns=[v1, 'Count', 'Converted Rate(%)']
    return merge

def crosstab (df, features, target):
    """Plot the bar char of count by converted"""
    for feature in features:
        pd.crosstab(df[feature],df[target]).plot(kind='barh', figsize=(13,8), stacked=True)
        plt.title('Number of '+ feature+' by '+ target)
        plt.xlabel('Count')
        plt.ylabel(feature)
        # Display the table obove each chart 
        return count(df, feature, target)  

Lead Origin

In [168]:
crosstab(data, ['Lead Origin'], 'Converted')
Out[168]:
Lead Origin Count Converted Rate(%)
0 Landing Page Submission 4886 0.36
1 API 3580 0.31
2 Lead Add Form 718 0.92
3 Lead Import&Quick Add Form 56 0.25

Lead Source

In [178]:
crosstab(data, ['Lead Source'], 'Converted')
Out[178]:
Lead Source Count Converted Rate(%)
0 Google 2873 0.40
1 Direct Traffic 2543 0.32
2 Olark Chat 1755 0.26
3 Organic Search 1154 0.38
4 Reference 534 0.92
5 Welingak Website 142 0.99
6 Referral Sites 125 0.25
7 Others 59 0.64
8 Facebook 55 0.24
In [170]:
# There are google and Google so we need to fix the spelling issue
data.loc[data['Lead Source'] == 'google', 'Lead Source'] = 'Google'

# Put the new label for labels with few results so as to avoid data shift between train 
# and test data.
data.loc[data['Lead Source'] == 'bing', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'Click2call', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'Press_Release', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'Social Media', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'Live Chat', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'WeLearn', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'Pay per Click Ads', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'NC_EDM', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'blog', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'testone', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'welearnblog_Home', 'Lead Source'] = 'Others'
data.loc[data['Lead Source'] == 'youtubechannel', 'Lead Source'] = 'Others'
In [171]:
crosstab(data, ['Lead Source'], 'Converted')
Out[171]:
Lead Source Count Converted Rate(%)
0 Google 2873 0.40
1 Direct Traffic 2543 0.32
2 Olark Chat 1755 0.26
3 Organic Search 1154 0.38
4 Reference 534 0.92
5 Welingak Website 142 0.99
6 Referral Sites 125 0.25
7 Others 59 0.64
8 Facebook 55 0.24

Last Activity

In [172]:
crosstab(data, ['Last Activity'], 'Converted')
Out[172]:
Last Activity Count Converted Rate(%)
0 Email Opened 3437 0.36
1 SMS Sent 2745 0.63
2 Olark Chat Conversation 973 0.09
3 Page Visited on Website 640 0.24
4 Converted to Lead 428 0.13
5 Email Bounced 326 0.08
6 Email Link Clicked 267 0.27
7 Form Submitted on Website 116 0.24
8 Others 103 0.79
9 Unreachable 93 0.33
10 Unsubscribed 61 0.26
11 Had a Phone Conversation 30 0.73
12 Approached upfront 9 1.00
13 View in browser link Clicked 6 0.17
14 Email Received 2 1.00
15 Email Marked Spam 2 1.00
16 Resubscribed to emails 1 1.00
17 Visited Booth in Tradeshow 1 0.00
In [173]:
data.loc[data['Last Activity'] == 'Had a Phone Conversation', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'Approached upfront', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'View in browser link Clicked', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'Email Received', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'Email Marked Spam', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'Resubscribed to emails', 'Last Activity'] = 'Others'
data.loc[data['Last Activity'] == 'Visited Booth in Tradeshow', 'Last Activity'] = 'Others'
In [174]:
crosstab(data, ['Last Activity'], 'Converted')
Out[174]:
Last Activity Count Converted Rate(%)
0 Email Opened 3437 0.36
1 SMS Sent 2745 0.63
2 Olark Chat Conversation 973 0.09
3 Page Visited on Website 640 0.24
4 Converted to Lead 428 0.13
5 Email Bounced 326 0.08
6 Email Link Clicked 267 0.27
7 Others 154 0.77
8 Form Submitted on Website 116 0.24
9 Unreachable 93 0.33
10 Unsubscribed 61 0.26

Country

In [187]:
crosstab(data, ['Country'], 'Converted')
Out[187]:
Country Count Converted Rate(%)
0 India 6492 0.37
1 Others 2461 0.44
2 United States 69 0.26
3 United Arab Emirates 53 0.38
4 Singapore 24 0.46
5 Saudi Arabia 21 0.19
6 United Kingdom 15 0.33
7 Australia 13 0.23
8 Qatar 10 0.10
9 Hong Kong 7 0.57
10 Bahrain 7 0.57
11 Oman 6 0.50
12 France 6 0.50
13 unknown 5 0.20
14 South Africa 4 0.25
15 Germany 4 0.25
16 Kuwait 4 0.00
17 Canada 4 0.00
18 Nigeria 4 0.00
19 Sweden 3 0.33
20 Uganda 2 0.00
21 Philippines 2 0.00
22 Asia/Pacific Region 2 0.50
23 Italy 2 0.00
24 Ghana 2 0.00
25 China 2 0.00
26 Belgium 2 0.00
27 Bangladesh 2 0.50
28 Netherlands 2 0.50
29 Malaysia 1 0.00
30 Liberia 1 0.00
31 Russia 1 0.00
32 Kenya 1 0.00
33 Indonesia 1 0.00
34 Sri Lanka 1 0.00
35 Switzerland 1 0.00
36 Tanzania 1 0.00
37 Denmark 1 1.00
38 Vietnam 1 0.00

To be dropped because almost all of the leads come from India.

City

In [189]:
crosstab(data, ['City'], 'Converted')
Out[189]:
City Count Converted Rate(%)
0 Other Cities 4355 0.35
1 Mumbai 3222 0.41
2 Thane & Outskirts 752 0.45
3 Other Cities of Maharashtra 457 0.44
4 Other Metro Cities 380 0.41
5 Tier II Cities 74 0.34

Specialization

In [329]:
crosstab(data, ['Specialization'], 'Converted')
Out[329]:
Specialization Count Converted Rate(%)
0 Others 3380 0.286686
1 Finance Management 976 0.446721
2 Human Resource Management 848 0.457547
3 Marketing Management 838 0.486874
4 Operations Management 503 0.473161
5 Business Administration 403 0.444169
6 IT Projects Management 366 0.382514
7 Supply Chain Management 349 0.432665
8 Banking, Investment And Insurance 338 0.494083
9 Media and Advertising 203 0.418719
10 Travel and Tourism 203 0.354680
11 International Business 178 0.359551
12 Healthcare Management 159 0.496855
13 Hospitality Management 114 0.421053
14 E-Commence 112 0.357143
15 Retail Management 100 0.340000
16 Rural and Agribusiness 73 0.424658
17 E-Business 57 0.368421
18 Services Excellence 40 0.275000
In [330]:
# Reduce redundancy by merging similar categories
data.loc[data['Specialization'] == 'Finance Management', 'Specialization'] = 'Finance'
data.loc[data['Specialization'] == 'Banking, Investment And Insurance', 'Specialization'] = 'Finance'

data.loc[data['Specialization'] == 'Human Resource Management', 'Specialization'] = 'Human Resource'

data.loc[data['Specialization'] == 'Operations Management', 'Specialization'] = 'Operations and Supply Chain'
data.loc[data['Specialization'] == 'Supply Chain Management', 'Specialization'] = 'Operations and Supply Chain'

data.loc[data['Specialization'] == 'Marketing Management', 'Specialization'] = 'Marketing'
data.loc[data['Specialization'] == 'Media and Advertising', 'Specialization'] = 'Marketing'

data.loc[data['Specialization'] == 'Business Administration', 'Specialization'] = 'Business'
data.loc[data['Specialization'] == 'International Business', 'Specialization'] = 'Business'
data.loc[data['Specialization'] == 'E-Commence', 'Specialization'] = 'Business'
data.loc[data['Specialization'] == 'E-Business', 'Specialization'] = 'Business'

data.loc[data['Specialization'] == 'Hospitality Management', 'Specialization'] = 'Tourism and Hospitality'
data.loc[data['Specialization'] == 'Services Excellence', 'Specialization'] = 'Tourism and Hospitality'
data.loc[data['Specialization'] == 'Travel and Tourism', 'Specialization'] = 'Tourism and Hospitality'

data.loc[data['Specialization'] == 'Healthcare Management', 'Specialization'] = 'Healthcare'

data.loc[data['Specialization'] == 'Retail Management', 'Specialization'] = 'Retail'

data.loc[data['Specialization'] == 'IT Projects Management', 'Specialization'] = 'IT'
In [331]:
crosstab(data, ['Specialization'], 'Converted')
Out[331]:
Specialization Count Converted Rate(%)
0 Others 3380 0.286686
1 Finance 1314 0.458904
2 Marketing 1041 0.473583
3 Operations and Supply Chain 852 0.456573
4 Human Resource 848 0.457547
5 Business 750 0.405333
6 IT 366 0.382514
7 Tourism and Hospitality 357 0.366947
8 Healthcare 159 0.496855
9 Retail 100 0.340000
10 Rural and Agribusiness 73 0.424658

How did you hear about Data Science Online Courses

In [190]:
crosstab(data, ['HowHear'], 'Converted')
Out[190]:
HowHear Count Converted Rate(%)
0 Other 7436 0.38
1 Online Search 808 0.42
2 Word Of Mouth 348 0.44
3 Student of SomeSchool 310 0.46
4 Multiple Sources 152 0.37
5 Advertisements 70 0.46
6 Social Media 67 0.42
7 Email 26 0.50
8 SMS 23 0.22
In [191]:
# Reduce redundancy by merging similar categories
data.loc[data['HowHear'] == 'Student of SomeSchool', 'HowHear'] = 'Word Of Mouth'
data.loc[data['HowHear'] == 'Other', 'HowHear'] = 'Others'
In [192]:
crosstab(data, ['HowHear'], 'Converted')
Out[192]:
HowHear Count Converted Rate(%)
0 Others 7436 0.38
1 Online Search 808 0.42
2 Word Of Mouth 658 0.45
3 Multiple Sources 152 0.37
4 Advertisements 70 0.46
5 Social Media 67 0.42
6 Email 26 0.50
7 SMS 23 0.22

What is your current occupation?

In [193]:
crosstab(data, ['CurrentOccupation'], 'Converted')
Out[193]:
CurrentOccupation Count Converted Rate(%)
0 Unemployed 5600 0.44
1 Others 2690 0.14
2 Working Professional 706 0.92
3 Student 210 0.37
4 Other 16 0.62
5 Housewife 10 1.00
6 Businessman 8 0.62
In [194]:
data.loc[data['CurrentOccupation'] == 'Other', 'CurrentOccupation'] = 'Others'
data.loc[data['CurrentOccupation'] == 'Businessman', 'CurrentOccupation'] = 'Working Professional'
data.loc[data['CurrentOccupation'] == 'Housewife', 'CurrentOccupation'] = 'Unemployed'
In [195]:
crosstab(data, ['CurrentOccupation'], 'Converted')
Out[195]:
CurrentOccupation Count Converted Rate(%)
0 Unemployed 5610 0.44
1 Others 2706 0.14
2 Working Professional 714 0.91
3 Student 210 0.37

What matters most to you in choosing this course

In [196]:
crosstab(data, ['WhatMatters'], 'Converted')
Out[196]:
WhatMatters Count Converted Rate(%)
0 Better Career Prospects 6528 0.49
1 Others 2709 0.14
2 Flexibility & Convenience 2 0.50
3 Other 1 0.00
In [197]:
data.loc[data['WhatMatters'] == 'Flexibility & Convenience', 'WhatMatters'] = 'Others'
data.loc[data['WhatMatters'] == 'Other', 'WhatMatters'] = 'Others'
In [198]:
crosstab(data, ['WhatMatters'], 'Converted')
Out[198]:
WhatMatters Count Converted Rate(%)
0 Better Career Prospects 6528 0.49
1 Others 2712 0.14

Tags

In [340]:
crosstab(data, ['Tags'], 'Converted')
Out[340]:
Tags Count Converted Rate(%)
0 Others 3353 0.249329
1 Will revert after reading the email 2072 0.968629
2 Ringing 1203 0.028263
3 Interested in other courses 513 0.025341
4 Already a student 465 0.006452
5 Closed by Horizzon 358 0.994413
6 switched off 240 0.016667
7 Busy 186 0.564516
8 Lost to EINS 175 0.977143
9 Not doing further education 145 0.006897
10 Interested in full time MBA 117 0.025641
11 Graduation in progress 111 0.063063
12 invalid number 83 0.012048
13 Diploma holder (Not Eligible) 63 0.015873
14 wrong number given 47 0.000000
15 opp hangup 33 0.090909
16 number not provided 27 0.000000
17 in touch with EINS 12 0.250000
18 Lost to Others 7 0.000000
19 Still Thinking 6 0.166667
20 Want to take admission but has financial problems 6 0.333333
21 In confusion whether part time or DLP 5 0.200000
22 Interested in Next batch 5 1.000000
23 Lateral student 3 1.000000
24 University not recognized 2 0.000000
25 Shall take in the next coming month 2 0.500000
26 Recognition issue (DEC approval) 1 0.000000
In [341]:
# Reduce redundancy by merging similar categories
data.loc[data['Tags'] == 'wrong number given', 'Tags'] = 'invalid number or not provided'
data.loc[data['Tags'] == 'number not provided', 'Tags'] = 'invalid number or not provided'
data.loc[data['Tags'] == 'Already a student', 'Tags'] = 'Current Student'
data.loc[data['Tags'] == 'Graduation in progress', 'Tags'] = 'Current Student'
data.loc[data['Tags'] == 'Lateral student', 'Tags'] = 'Current Student'


data.loc[data['Tags'] == 'switched off', 'Tags'] = 'No Response'
data.loc[data['Tags'] == 'Busy', 'Tags'] = 'No Response'
data.loc[data['Tags'] == 'opp hangup', 'Tags'] = 'No Response'



data.loc[data['Tags'] == 'Interested in Next batch', 'Tags'] = 'Insterested'
data.loc[data['Tags'] == 'Shall take in the next coming month', 'Tags'] = 'Insterested'

data.loc[data['Tags'] == 'In confusion whether part time or DLP', 'Tags'] = 'Have Question'
data.loc[data['Tags'] == 'Want to take admission but has financial problems', 'Tags'] = 'Have Question'

data.loc[data['Tags'] == 'Recognition issue (DEC approval)', 'Tags'] = 'Have Question'
data.loc[data['Tags'] == 'University not recognized', 'Tags'] = 'Have Question'

data.loc[data['Tags'] == 'invalid number', 'Tags'] = 'invalid number or not provided'

data.loc[data['Tags'] == 'Interested in other courses', 'Tags'] = 'Insterested'
data.loc[data['Tags'] == 'Still Thinking', 'Tags'] = 'Insterested'
data.loc[data['Tags'] == 'Lost to EINS', 'Tags'] = 'Lost'
data.loc[data['Tags'] == 'Lost to Others', 'Tags'] = 'Lost'
In [342]:
crosstab(data, ['Tags'], 'Converted')
Out[342]:
Tags Count Converted Rate(%)
0 Others 3353 0.249329
1 Will revert after reading the email 2072 0.968629
2 Ringing 1203 0.028263
3 Current Student 579 0.022453
4 Insterested 526 0.038023
5 No Response 459 0.244009
6 Closed by Horizzon 358 0.994413
7 Lost 182 0.939560
8 invalid number or not provided 157 0.006369
9 Not doing further education 145 0.006897
10 Interested in full time MBA 117 0.025641
11 Diploma holder (Not Eligible) 63 0.015873
12 Have Question 14 0.214286
13 in touch with EINS 12 0.250000

Lead Quality

In [199]:
crosstab(data, ['Lead Quality'], 'Converted')
Out[199]:
Lead Quality Count Converted Rate(%)
0 Others 4767 0.21
1 Might be 1560 0.76
2 Not Sure 1092 0.24
3 High in Relevance 637 0.95
4 Worst 601 0.02
5 Low in Relevance 583 0.82

Lead Profile

In [200]:
crosstab(data, ['Lead Profile'], 'Converted')
Out[200]:
Lead Profile Count Converted Rate(%)
0 Select 4146 0.41
1 Others 2709 0.14
2 Potential Lead 1613 0.79
3 Other Leads 487 0.37
4 Student of SomeSchool 241 0.04
5 Lateral Student 24 0.96
6 Dual Specialization Student 20 1.00
In [201]:
data.loc[data['Lead Profile'] == 'Select', 'Lead Profile'] = 'Others'
data.loc[data['Lead Profile'] == 'Other Leads', 'Lead Profile'] = 'Others'
In [202]:
crosstab(data, ['Lead Profile'], 'Converted')
Out[202]:
Lead Profile Count Converted Rate(%)
0 Others 7342 0.31
1 Potential Lead 1613 0.79
2 Student of SomeSchool 241 0.04
3 Lateral Student 24 0.96
4 Dual Specialization Student 20 1.00

A free copy of Mastering The Case Study

In [203]:
crosstab(data, ['FreeCopy'], 'Converted')
Out[203]:
FreeCopy Count Converted Rate(%)
0 0 6352 0.40
1 1 2888 0.36

Last Notable Activity

In [204]:
crosstab(data, ['LastNotableActivity'], 'Converted')
Out[204]:
LastNotableActivity Count Converted Rate(%)
0 Modified 3407 0.23
1 Email Opened 2827 0.37
2 SMS Sent 2172 0.69
3 Page Visited on Website 318 0.29
4 Olark Chat Conversation 183 0.14
5 Email Link Clicked 173 0.26
6 Email Bounced 60 0.15
7 Unsubscribed 47 0.30
8 Unreachable 32 0.69
9 Had a Phone Conversation 14 0.93
10 Email Marked Spam 2 1.00
11 Approached upfront 1 1.00
12 Email Received 1 1.00
13 Form Submitted on Website 1 0.00
14 Resubscribed to emails 1 1.00
15 View in browser link Clicked 1 0.00

Last Activity and LastNotableActivity are similar columns so delete the latter one.
Delete Country, WhatMatters, FreeCopy as well because these columns have nearly one value and they are useless for further analysis.

In [207]:
drop_list3 = ['LastNotableActivity', 'Country', 'WhatMatters', 'FreeCopy']
data = data.drop(columns = drop_list3)
In [208]:
data.shape
Out[208]:
(9240, 27)
In [209]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9240 entries, 1 to 9240
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Lead Origin              9240 non-null   object 
 1   Lead Source              9240 non-null   object 
 2   Do Not Email             9240 non-null   int64  
 3   Do Not Call              9240 non-null   int64  
 4   Converted                9240 non-null   int64  
 5   TotalVisits              9240 non-null   float64
 6   TotalTime                9240 non-null   int64  
 7   PPV                      9240 non-null   float64
 8   Last Activity            9240 non-null   object 
 9   Specialization           9240 non-null   object 
 10  HowHear                  9240 non-null   object 
 11  CurrentOccupation        9240 non-null   object 
 12  Search                   9240 non-null   int64  
 13  Magazine                 9240 non-null   int64  
 14  Newspaper Article        9240 non-null   int64  
 15  DSForums                 9240 non-null   int64  
 16  Newspaper                9240 non-null   int64  
 17  Digital Advertisement    9240 non-null   int64  
 18  Through Recommendations  9240 non-null   int64  
 19  ReceiveUpdates           9240 non-null   int64  
 20  Tags                     9240 non-null   object 
 21  Lead Quality             9240 non-null   object 
 22  UpdateDScontent          9240 non-null   int64  
 23  UpdateDMcontent          9240 non-null   int64  
 24  Lead Profile             9240 non-null   object 
 25  City                     9240 non-null   object 
 26  PayThrough               9240 non-null   int64  
dtypes: float64(2), int64(15), object(10)
memory usage: 2.3+ MB
In [299]:
# Store the newly cleaned data file
# data.to_csv('Newly Cleaned Data Science Online Courses.csv')
In [210]:
data = pd.read_csv('Newly Cleaned Data Science Online Courses.csv', index_col = 0)

2.5 Correlation Check

In [211]:
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Altough TotalVisits and PPV are highly correlated(0.77), I decide not to drop them now because they are very important metrics for further analysis.
TotalVisits represent the exposure while PPV shows the enagement. PPV is particularly useful for online course website performance evaluation because visitors tend to make purchases when PPV is high.
I will wait to see if there is overfitting problem.
Actually in the later logistic regression, PPV is dropped eventually due to high P value.

3. Data Processing

3.1 One-Hot Encoding

In [212]:
# State the character columns 
categ_cols = data.columns[data.dtypes == np.dtype('O')].to_list()
In [213]:
categ_cols
Out[213]:
['Lead Origin',
 'Lead Source',
 'Last Activity',
 'Specialization',
 'HowHear',
 'CurrentOccupation',
 'Tags',
 'Lead Quality',
 'Lead Profile',
 'City']
In [214]:
onehot_data = pd.get_dummies(data[categ_cols], drop_first = True)
data = pd.concat([data, onehot_data], axis=1)
In [215]:
data = data.drop(categ_cols, axis=1)
data.head()
Out[215]:
Converted TotalVisits TotalTime PPV Lead Origin_Landing Page Submission Lead Origin_Lead Add Form Lead Origin_Lead Import&Quick Add Form Lead Source_Facebook Lead Source_Google Lead Source_Olark Chat ... Lead Quality_Worst Lead Profile_Lateral Student Lead Profile_Others Lead Profile_Potential Lead Lead Profile_Student of SomeSchool City_Other Cities City_Other Cities of Maharashtra City_Other Metro Cities City_Thane & Outskirts City_Tier II Cities
1 0 0.00 0 0.00 0 0 0 0 0 1 ... 0 0 1 0 0 1 0 0 0 0
2 0 5.00 674 2.50 0 0 0 0 0 0 ... 0 0 1 0 0 1 0 0 0 0
3 1 2.00 1532 2.00 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 0 1.00 305 1.00 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
5 1 2.00 1428 1.00 1 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 72 columns

In [216]:
print('The size of the total dataset is {}'.format(data.shape))
The size of the total dataset is (9240, 72)

3.2 Splitting Dataset into Training and Test Set

In [217]:
from sklearn.model_selection import train_test_split
X = data.drop(columns = ['Converted'])
y = data['Converted']

X_train, X_test, y_train, y_test, data_train, data_test = train_test_split(
    X,
    y,
    data,
    test_size = 0.2,
    stratify = y,
    shuffle = True
)
In [218]:
print('Shape of X_train: {}'.format(X_train.shape))
print('Shape of X_test: {}'.format(X_test.shape))
Shape of X_train: (7392, 71)
Shape of X_test: (1848, 71)

3.3 Feature Scaling

In [219]:
import warnings
warnings.filterwarnings('ignore')
In [220]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_features = ['TotalVisits',
                    'TotalTime',
                    'PPV']

X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test[numeric_features] = scaler.fit_transform(X_test[numeric_features])

4 Data Modeling

Functions for Perfomance Metric:

I assume that the current objective is to sell diploma-related courses.
To not waste any sales resources, I prefer a classifier that rejects many leads (relatively low recall) but keeps only the best true hot leads (high precision).

In [223]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
def flat_accuracy(pred, true):
    acc = (pred == y_test).sum() / y_test.shape[0]
    print('Accuracy = %.4f'% (acc*100) + '%')
        
def flat_precision(pred, true):
    precision = precision_score(true, pred, average='macro') # Due to unbalance, we prefer to use macro recall
    print('Precision = %.4f'% (precision*100) + '%')
    
def flat_recall(pred, true):
    recall = recall_score(true, pred, average='macro') # Due to unbalance, we prefer to use macro recall
    print('Recall = %.4f'% (recall*100) + '%')
    
def flat_f1(pred, true):
    f1 = f1_score(true, pred, average='macro')
    print('F1 = %.4f'% (f1*100) + '%')
In [224]:
from sklearn.metrics import confusion_matrix as cm
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import average_precision_score

def plot_cm(true, pred, model=''):
    # Confusion matrix
    cm_matrix = cm(true, pred)

    # Accuracy
    accuracy = (true == pred).sum() / len(true)
    plt.clf()
    plt.imshow(cm_matrix, interpolation='nearest', cmap=plt.cm.Wistia)
    classNames = ['0', '1']
    plt.title('Validation Data %s - Accuracy %.4f' % (model, accuracy))
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(cm_matrix[i][j]))
    plt.show()
    
import matplotlib.pyplot as plt
def plot_prc(classifier, X_test, y_test, y_pred):
    average_precision = average_precision_score(y_test, y_pred)
    disp = plot_precision_recall_curve(classifier, X_test, y_test)
    disp.ax_.set_title('2-class Precision-Recall curve: '
                       'AP={0:0.2f}'.format(average_precision))

4.1 CatBoost

4.11 Model Building

In [89]:
from catboost import CatBoostClassifier;
# We don't need to put any parameters. 
# To suppress CatBoost iteration results and make this files readable, 
# I put 'Silent' in logging_level
classifier_CatBoost = CatBoostClassifier(logging_level = 'Silent') 
classifier_CatBoost.fit(X_train, y_train)
y_pred_CatBoost = classifier_CatBoost.predict(X_test)
In [121]:
flat_accuracy(y_pred_CatBoost, y_test)
flat_precision(y_pred_CatBoost, y_test)
flat_recall(y_pred_CatBoost, y_test)
flat_f1(y_pred_CatBoost, y_test)
Accuracy = 92.4784%
Precision = 92.3685%
Recall = 91.6803%
F1 = 91.9975%
In [122]:
plot_cm(y_pred_CatBoost, y_test, 'CatBoost')
In [92]:
plot_prc(classifier_CatBoost, X_test, y_test, y_pred_CatBoost)
In [149]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred_CatBoost)
Out[149]:
0.916803489476183

Take a look at the feature importance

In [93]:
featureImp = []
for feat, importance in zip(X.columns, classifier_CatBoost.feature_importances_):  
    temp = [feat, importance]
    featureImp.append(temp)

fT_df = pd.DataFrame(featureImp, columns = ['Feature', 'Importance'])
fT_df.sort_values('Importance', ascending = False).head(15)
Out[93]:
Feature Importance
54 Tags_Will revert after reading the email 16.72
1 TotalTime 9.97
53 Tags_Ringing 8.59
21 Last Activity_SMS Sent 6.41
42 CurrentOccupation_Unemployed 5.19
4 Lead Origin_Lead Add Form 4.19
52 Tags_Others 4.10
61 Lead Quality_Worst 3.69
47 Tags_Insterested 3.64
0 TotalVisits 3.46
2 PPV 3.10
44 Tags_Current Student 3.09
49 Tags_Lost 2.38
60 Lead Quality_Others 2.16
16 Last Activity_Email Opened 1.84

4.12 K-Fold Cross Validation

In [94]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier_CatBoost, X = X_train, y = y_train, cv = 10)
In [95]:
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 92.98 %
Standard Deviation: 0.84 %

4.2 Random Forest

In [98]:
from sklearn.ensemble import RandomForestClassifier
Classifier_RF = RandomForestClassifier(criterion='entropy', n_estimators=200, n_jobs = -1)
In [99]:
parameters_RF = {'max_depth': [14,16,18],
            'min_samples_leaf': [1,2]}
In [100]:
from sklearn.model_selection import GridSearchCV
grid_search_RF = GridSearchCV(Classifier_RF, parameters_RF, cv=5, verbose=2, n_jobs=-1,
                              scoring='roc_auc_ovr_weighted')
In [101]:
grid_search_RF.fit(X_train, y_train)
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    8.1s finished
Out[101]:
GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='entropy',
                                              max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=200, n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [14, 16, 18], 'min_samples_leaf': [1, 2]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc_ovr_weighted', verbose=2)
In [102]:
grid_search_RF.best_params_
Out[102]:
{'max_depth': 16, 'min_samples_leaf': 1}

4.22 Model Building

In [123]:
Classifier_RF = RandomForestClassifier(criterion='entropy', n_estimators=200, min_samples_split=4,
                                       max_depth=16, min_samples_leaf=1, n_jobs = -1)
In [124]:
Classifier_RF.fit(X_train, y_train)
y_pred_RF = Classifier_RF.predict(X_test)
In [125]:
flat_accuracy(y_pred_RF, y_test)
flat_precision(y_pred_RF, y_test)
flat_recall(y_pred_RF, y_test)
flat_f1(y_pred_RF, y_test)
Accuracy = 92.3701%
Precision = 92.3812%
Recall = 91.4351%
F1 = 91.8591%
In [107]:
plot_cm(y_pred_RF, y_test, 'Random Forest')
In [225]:
plot_prc(Classifier_RF, X_test, y_test, y_pred_RF)

4.23 K-Fold Cross Validation

In [108]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = Classifier_RF, X = X_train, y = y_train, cv = 10)
In [109]:
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 92.59 %
Standard Deviation: 0.85 %

4.3 XGBoost

In [110]:
from xgboost import XGBClassifier
Classifier_XGBoost = XGBClassifier(booster = 'gbtree')
In [111]:
parameters_XG = {'n_estimators': range(200, 500, 700),
                 'max_depth': [35,45,55],
                 'learning_rate': [0.1, 0.01, 0.05]
                }
In [112]:
grid_search_XG = GridSearchCV(Classifier_XGBoost, parameters_XG, cv=5, verbose=2, n_jobs=-1,
                              scoring='roc_auc_ovr_weighted')
In [113]:
grid_search_XG.fit(X_train, y_train)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  2.6min finished
Out[113]:
GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster='gbtree',
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_e...
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     subsample=None, tree_method=None,
                                     validate_parameters=None, verbosity=None),
             iid='deprecated', n_jobs=-1,
             param_grid={'learning_rate': [0.1, 0.01, 0.05],
                         'max_depth': [35, 45, 55],
                         'n_estimators': range(200, 500, 700)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc_ovr_weighted', verbose=2)
In [114]:
grid_search_XG.best_params_
Out[114]:
{'learning_rate': 0.05, 'max_depth': 35, 'n_estimators': 200}

4.32 Model Building

In [117]:
Classifier_XG = XGBClassifier(booster = 'gbtree', max_depth=35, n_estimators=200,
                              learning_rate=0.05, n_jobs=-1)
In [118]:
Classifier_XG.fit(X_train, y_train)
y_pred_XG = Classifier_XG.predict(X_test)
In [119]:
flat_accuracy(y_pred_XG, y_test)
flat_precision(y_pred_XG, y_test)
flat_recall(y_pred_XG, y_test)
flat_f1(y_pred_XG, y_test)
Accuracy = 92.2619%
Precision = 91.8408%
Recall = 91.8188%
F1 = 91.8298%
In [120]:
plot_cm(y_pred_XG, y_test, 'Random Forest')
In [226]:
plot_prc(Classifier_XG, X_test, y_test, y_pred_XG)

4.33 K-Fold Cross Validation

In [126]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = Classifier_XG,  X = X_train, y = y_train, cv = 10)
In [127]:
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 92.56 %
Standard Deviation: 0.69 %

4.4 Logistic Regression

Strategies:

  • Put all features in the model first and then drop features that are not significantly related to "converted" based on VIF and P value.
  • Try different models with various features until we get the best one.
In [25]:
import statsmodels.api as sm
In [26]:
from sklearn.linear_model import LogisticRegression
classifier_LR = LogisticRegression(random_state = 0)

from sklearn.feature_selection import RFE
rfe = RFE(classifier_LR, 200) # running RFE with 200 variables as output
rfe = rfe.fit(X_train, y_train)
In [27]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))
col = X_train.columns[rfe.support_]
X_train.columns[~rfe.support_];

4.41 1st Try

In [28]:
X_train_sm_1 = sm.add_constant(X_train[col])
lr1 = sm.GLM(y_train, X_train_sm_1, family = sm.families.Binomial())
res = lr1.fit()
res.summary()
Out[28]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7320
Model Family: Binomial Df Model: 71
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -1328.6
Date: Sun, 14 Mar 2021 Deviance: 2657.3
Time: 14:43:50 Pearson chi2: 9.52e+03
No. Iterations: 25
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 24.3136 7.88e+04 0.000 1.000 -1.54e+05 1.54e+05
TotalVisits 0.3200 0.084 3.801 0.000 0.155 0.485
TotalTime 1.1394 0.061 18.722 0.000 1.020 1.259
PPV -0.4094 0.097 -4.237 0.000 -0.599 -0.220
Lead Origin_Landing Page Submission -0.2853 0.241 -1.186 0.236 -0.757 0.186
Lead Origin_Lead Add Form -0.3028 1.053 -0.287 0.774 -2.367 1.762
Lead Origin_Lead Import&Quick Add Form 19.3663 2.32e+04 0.001 0.999 -4.55e+04 4.55e+04
Lead Source_Facebook -19.9211 2.32e+04 -0.001 0.999 -4.55e+04 4.55e+04
Lead Source_Google 0.0830 0.180 0.462 0.644 -0.269 0.435
Lead Source_Olark Chat 1.0286 0.243 4.225 0.000 0.551 1.506
Lead Source_Organic Search 0.2457 0.201 1.221 0.222 -0.149 0.640
Lead Source_Others 1.4073 0.891 1.579 0.114 -0.340 3.155
Lead Source_Reference 0.7384 1.107 0.667 0.505 -1.431 2.907
Lead Source_Referral Sites -0.4825 0.536 -0.901 0.368 -1.532 0.567
Lead Source_Welingak Website 4.1759 1.283 3.255 0.001 1.661 6.691
Last Activity_Email Bounced -0.4875 0.508 -0.959 0.338 -1.484 0.509
Last Activity_Email Link Clicked 1.0632 0.458 2.321 0.020 0.165 1.961
Last Activity_Email Opened 1.5365 0.331 4.638 0.000 0.887 2.186
Last Activity_Form Submitted on Website 0.9166 0.628 1.460 0.144 -0.314 2.147
Last Activity_Olark Chat Conversation -0.0700 0.386 -0.181 0.856 -0.827 0.687
Last Activity_Others 0.6195 0.620 1.000 0.317 -0.595 1.834
Last Activity_Page Visited on Website 0.4748 0.398 1.192 0.233 -0.306 1.255
Last Activity_SMS Sent 3.2645 0.336 9.704 0.000 2.605 3.924
Last Activity_Unreachable 1.3745 0.660 2.081 0.037 0.080 2.669
Last Activity_Unsubscribed 1.4585 0.729 2.001 0.045 0.030 2.887
Specialization_Finance -0.0766 0.232 -0.331 0.741 -0.531 0.378
Specialization_Healthcare -0.2029 0.471 -0.431 0.666 -1.125 0.720
Specialization_Human Resource -0.2241 0.254 -0.881 0.378 -0.723 0.275
Specialization_IT -0.0781 0.327 -0.239 0.811 -0.718 0.562
Specialization_Marketing 0.0899 0.238 0.378 0.706 -0.377 0.556
Specialization_Operations and Supply Chain -0.4532 0.258 -1.757 0.079 -0.959 0.052
Specialization_Others -0.1765 0.278 -0.635 0.526 -0.721 0.368
Specialization_Retail -0.5853 0.534 -1.096 0.273 -1.632 0.461
Specialization_Rural and Agribusiness 0.1232 0.622 0.198 0.843 -1.095 1.341
Specialization_Tourism and Hospitality -0.8777 0.333 -2.638 0.008 -1.530 -0.226
HowHear_Email 0.5036 1.156 0.436 0.663 -1.763 2.770
HowHear_Multiple Sources -0.6186 0.649 -0.954 0.340 -1.890 0.653
HowHear_Online Search -0.4594 0.561 -0.819 0.413 -1.559 0.640
HowHear_Others -0.4943 0.551 -0.898 0.369 -1.573 0.585
HowHear_SMS -0.3027 1.016 -0.298 0.766 -2.293 1.688
HowHear_Social Media -0.1248 0.854 -0.146 0.884 -1.798 1.548
HowHear_Word Of Mouth 0.0643 0.565 0.114 0.909 -1.043 1.172
CurrentOccupation_Student 2.4660 0.496 4.974 0.000 1.494 3.438
CurrentOccupation_Unemployed 2.5735 0.147 17.541 0.000 2.286 2.861
CurrentOccupation_Working Professional 2.9481 0.333 8.844 0.000 2.295 3.601
Tags_Current Student -7.9293 0.832 -9.527 0.000 -9.561 -6.298
Tags_Diploma holder (Not Eligible) -9.1651 1.391 -6.589 0.000 -11.891 -6.439
Tags_Have Question -6.9366 1.356 -5.117 0.000 -9.594 -4.280
Tags_Insterested -8.1874 0.816 -10.036 0.000 -9.786 -6.589
Tags_Interested in full time MBA -8.5061 1.107 -7.684 0.000 -10.676 -6.336
Tags_Lost -0.3686 0.861 -0.428 0.669 -2.056 1.319
Tags_No Response -7.1632 0.788 -9.090 0.000 -8.708 -5.619
Tags_Not doing further education -8.7396 1.282 -6.819 0.000 -11.252 -6.227
Tags_Others -4.1304 0.777 -5.319 0.000 -5.652 -2.608
Tags_Ringing -9.8212 0.804 -12.215 0.000 -11.397 -8.245
Tags_Will revert after reading the email -1.8915 0.757 -2.498 0.012 -3.376 -0.407
Tags_in touch with EINS -6.7151 1.434 -4.682 0.000 -9.526 -3.904
Tags_invalid number or not provided -46.3466 3.29e+04 -0.001 0.999 -6.45e+04 6.44e+04
Lead Quality_Low in Relevance -0.9031 0.467 -1.933 0.053 -1.819 0.013
Lead Quality_Might be -1.0661 0.447 -2.387 0.017 -1.941 -0.191
Lead Quality_Not Sure -0.0750 0.474 -0.158 0.874 -1.004 0.855
Lead Quality_Others -1.2008 0.477 -2.519 0.012 -2.135 -0.267
Lead Quality_Worst -2.2408 0.863 -2.597 0.009 -3.932 -0.549
Lead Profile_Lateral Student -18.5193 7.88e+04 -0.000 1.000 -1.54e+05 1.54e+05
Lead Profile_Others -22.4474 7.88e+04 -0.000 1.000 -1.54e+05 1.54e+05
Lead Profile_Potential Lead -21.9559 7.88e+04 -0.000 1.000 -1.54e+05 1.54e+05
Lead Profile_Student of SomeSchool -24.1017 7.88e+04 -0.000 1.000 -1.54e+05 1.54e+05
City_Other Cities 0.1189 0.190 0.625 0.532 -0.254 0.492
City_Other Cities of Maharashtra 0.0959 0.252 0.381 0.703 -0.397 0.589
City_Other Metro Cities 0.0846 0.287 0.295 0.768 -0.477 0.646
City_Thane & Outskirts -0.0418 0.218 -0.191 0.848 -0.470 0.386
City_Tier II Cities 0.6024 0.519 1.162 0.245 -0.414 1.619
In [29]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
VIF_rank = pd.DataFrame()
VIF_rank['Features'] = X_train[col].columns
VIF_rank['VIF'] = [variance_inflation_factor(X_train[col].values, i) 
                                    for i in range(X_train[col].shape[1])]
VIF_rank['VIF'] = round(VIF_rank['VIF'], 2)
VIF_rank = VIF_rank.sort_values(by = "VIF", ascending = False)
VIF_rank.head(30)
Out[29]:
Features VIF
63 Lead Profile_Others 137.92
37 HowHear_Others 91.55
4 Lead Origin_Lead Add Form 45.39
11 Lead Source_Reference 34.38
64 Lead Profile_Potential Lead 29.09
52 Tags_Others 19.93
60 Lead Quality_Others 19.64
5 Lead Origin_Lead Import&Quick Add Form 15.55
6 Lead Source_Facebook 15.47
36 HowHear_Online Search 10.64
13 Lead Source_Welingak Website 10.34
30 Specialization_Others 10.11
3 Lead Origin_Landing Page Submission 10.07
16 Last Activity_Email Opened 9.63
42 CurrentOccupation_Unemployed 9.50
40 HowHear_Word Of Mouth 8.89
21 Last Activity_SMS Sent 8.40
54 Tags_Will revert after reading the email 8.30
53 Tags_Ringing 6.63
65 Lead Profile_Student of SomeSchool 6.52
66 City_Other Cities 5.64
58 Lead Quality_Might be 4.90
59 Lead Quality_Not Sure 4.82
8 Lead Source_Olark Chat 4.55
44 Tags_Current Student 4.32
61 Lead Quality_Worst 4.21
18 Last Activity_Olark Chat Conversation 3.80
7 Lead Source_Google 3.67
47 Tags_Insterested 3.40
2 PPV 3.40
In [30]:
# pd.set_option('display.max_rows', None);
In [31]:
col1 = col.drop(['Lead Profile_Others','HowHear_Others', 'Lead Source_Facebook', 
                'Lead Origin_Lead Import&Quick Add Form', 'Lead Origin_Lead Add Form',
                 'Lead Source_Reference', 'Lead Profile_Potential Lead',
                'Tags_Others', 'Lead Quality_Others', 'Specialization_Others'], 1)

4.42 2nd Try

In [32]:
X_train_sm_2 = sm.add_constant(X_train[col1])
lr2 = sm.GLM(y_train, X_train_sm_2, family = sm.families.Binomial())
res = lr2.fit()
res.summary()
Out[32]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7330
Model Family: Binomial Df Model: 61
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -1426.3
Date: Sun, 14 Mar 2021 Deviance: 2852.7
Time: 14:43:53 Pearson chi2: 1.01e+04
No. Iterations: 24
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.8294 0.390 -9.812 0.000 -4.594 -3.065
TotalVisits 0.3054 0.079 3.853 0.000 0.150 0.461
TotalTime 1.0808 0.057 18.806 0.000 0.968 1.193
PPV -0.5264 0.091 -5.760 0.000 -0.705 -0.347
Lead Origin_Landing Page Submission -0.6585 0.180 -3.667 0.000 -1.011 -0.307
Lead Source_Google -0.2565 0.154 -1.661 0.097 -0.559 0.046
Lead Source_Olark Chat 0.2560 0.189 1.352 0.176 -0.115 0.627
Lead Source_Organic Search 0.0731 0.188 0.390 0.697 -0.294 0.441
Lead Source_Others 0.6197 0.666 0.931 0.352 -0.685 1.925
Lead Source_Referral Sites -0.8064 0.497 -1.622 0.105 -1.781 0.168
Lead Source_Welingak Website 2.8762 0.748 3.847 0.000 1.411 4.342
Last Activity_Email Bounced -0.1609 0.470 -0.342 0.732 -1.082 0.760
Last Activity_Email Link Clicked 1.7346 0.424 4.088 0.000 0.903 2.566
Last Activity_Email Opened 1.9146 0.328 5.845 0.000 1.273 2.557
Last Activity_Form Submitted on Website 1.2026 0.599 2.008 0.045 0.029 2.376
Last Activity_Olark Chat Conversation 0.2426 0.381 0.637 0.524 -0.504 0.989
Last Activity_Others 2.0172 0.541 3.726 0.000 0.956 3.078
Last Activity_Page Visited on Website 1.0523 0.380 2.767 0.006 0.307 1.798
Last Activity_SMS Sent 3.5479 0.335 10.606 0.000 2.892 4.204
Last Activity_Unreachable 1.9328 0.602 3.209 0.001 0.752 3.113
Last Activity_Unsubscribed 1.8380 0.698 2.632 0.008 0.469 3.207
Specialization_Finance 0.1900 0.188 1.010 0.313 -0.179 0.559
Specialization_Healthcare 0.3970 0.392 1.014 0.311 -0.370 1.164
Specialization_Human Resource 0.1316 0.208 0.634 0.526 -0.275 0.538
Specialization_IT 0.0827 0.293 0.282 0.778 -0.491 0.656
Specialization_Marketing 0.2801 0.196 1.427 0.154 -0.105 0.665
Specialization_Operations and Supply Chain -0.1257 0.210 -0.599 0.549 -0.537 0.286
Specialization_Retail -0.1417 0.483 -0.293 0.769 -1.088 0.804
Specialization_Rural and Agribusiness 0.3988 0.589 0.677 0.498 -0.756 1.553
Specialization_Tourism and Hospitality -0.5240 0.292 -1.793 0.073 -1.097 0.049
HowHear_Email 1.1764 1.012 1.163 0.245 -0.807 3.160
HowHear_Multiple Sources -0.0333 0.383 -0.087 0.931 -0.783 0.717
HowHear_Online Search 0.2423 0.210 1.154 0.248 -0.169 0.654
HowHear_SMS 0.3216 0.879 0.366 0.714 -1.400 2.044
HowHear_Social Media 0.4243 0.642 0.661 0.509 -0.834 1.683
HowHear_Word Of Mouth 0.6819 0.214 3.189 0.001 0.263 1.101
CurrentOccupation_Student 2.8189 0.446 6.314 0.000 1.944 3.694
CurrentOccupation_Unemployed 3.2378 0.140 23.181 0.000 2.964 3.512
CurrentOccupation_Working Professional 4.1411 0.299 13.855 0.000 3.555 4.727
Tags_Current Student -4.2717 0.371 -11.525 0.000 -4.998 -3.545
Tags_Diploma holder (Not Eligible) -5.7073 1.206 -4.731 0.000 -8.072 -3.343
Tags_Have Question -3.0429 1.123 -2.710 0.007 -5.243 -0.842
Tags_Insterested -4.6164 0.339 -13.610 0.000 -5.281 -3.952
Tags_Interested in full time MBA -5.0942 0.818 -6.225 0.000 -6.698 -3.490
Tags_Lost 3.5110 0.415 8.461 0.000 2.698 4.324
Tags_No Response -3.6096 0.263 -13.703 0.000 -4.126 -3.093
Tags_Not doing further education -5.3006 1.047 -5.065 0.000 -7.352 -3.249
Tags_Ringing -6.2504 0.298 -20.997 0.000 -6.834 -5.667
Tags_Will revert after reading the email 1.6597 0.218 7.616 0.000 1.233 2.087
Tags_in touch with EINS -3.2024 1.208 -2.651 0.008 -5.570 -0.835
Tags_invalid number or not provided -27.3871 1.68e+04 -0.002 0.999 -3.29e+04 3.29e+04
Lead Quality_Low in Relevance 0.5226 0.277 1.888 0.059 -0.020 1.065
Lead Quality_Might be 0.0911 0.223 0.409 0.683 -0.346 0.528
Lead Quality_Not Sure 1.0367 0.239 4.346 0.000 0.569 1.504
Lead Quality_Worst -1.2047 0.716 -1.682 0.093 -2.608 0.199
Lead Profile_Lateral Student 2.9320 1.402 2.091 0.037 0.183 5.681
Lead Profile_Student of SomeSchool -1.2769 1.018 -1.254 0.210 -3.273 0.719
City_Other Cities -0.1826 0.164 -1.112 0.266 -0.505 0.139
City_Other Cities of Maharashtra 0.0442 0.235 0.188 0.851 -0.417 0.505
City_Other Metro Cities 0.0035 0.278 0.013 0.990 -0.540 0.548
City_Thane & Outskirts 0.0328 0.200 0.164 0.870 -0.359 0.424
City_Tier II Cities 0.5709 0.505 1.130 0.259 -0.420 1.562
In [33]:
VIF_rank = pd.DataFrame()
VIF_rank['Features'] = X_train[col1].columns
VIF_rank['VIF'] = [variance_inflation_factor(X_train[col1].values, i) 
                   for i in range(X_train[col1].shape[1])]
VIF_rank['VIF'] = round(VIF_rank['VIF'], 2)
VIF_rank = VIF_rank.sort_values(by = "VIF", ascending = False)
VIF_rank.head(20)
Out[33]:
Features VIF
36 CurrentOccupation_Unemployed 7.54
3 Lead Origin_Landing Page Submission 5.51
12 Last Activity_Email Opened 4.97
17 Last Activity_SMS Sent 4.56
47 Tags_Will revert after reading the email 4.18
56 City_Other Cities 4.10
2 PPV 3.25
5 Lead Source_Olark Chat 3.01
0 TotalVisits 2.76
4 Lead Source_Google 2.72
46 Tags_Ringing 2.57
53 Lead Quality_Worst 2.48
51 Lead Quality_Might be 2.45
38 Tags_Current Student 2.44
14 Last Activity_Olark Chat Conversation 2.41
37 CurrentOccupation_Working Professional 2.20
55 Lead Profile_Student of SomeSchool 2.14
20 Specialization_Finance 1.95
52 Lead Quality_Not Sure 1.80
44 Tags_No Response 1.79
In [34]:
# Drop the features with P values that are larger than 0.05 
# Drop the features with VIF scores that are larger than 3
col2 = col1.drop(['Lead Source_Olark Chat', 'Lead Source_Organic Search','Lead Source_Others',
                'Lead Source_Referral Sites', 'Lead Source_Welingak Website',
                'Last Activity_Email Bounced','Last Activity_Form Submitted on Website',
                'Last Activity_Olark Chat Conversation','Last Activity_Unsubscribed',
                'Specialization_Finance','Specialization_Healthcare',
                'Specialization_IT','Specialization_IT','Specialization_IT',
                'Specialization_IT','Specialization_IT','Specialization_IT',
                'Specialization_IT', 'Specialization_Marketing',
                 'Specialization_Operations and Supply Chain',
                 'Specialization_Retail','Specialization_Rural and Agribusiness',
                'HowHear_Email', 'HowHear_Multiple Sources',
                 'Tags_Not doing further education', 'Tags_invalid number or not provided',
                'Lead Quality_Might be','Lead Profile_Lateral Student',
                 'Lead Profile_Student of SomeSchool', 'City_Other Cities',
                 'City_Other Cities of Maharashtra', 'City_Other Metro Cities',
                 'City_Thane & Outskirts', 'City_Tier II Cities'], 1)

4.43 3rd Try

In [35]:
X_train_sm_3 = sm.add_constant(X_train[col2])
lr3 = sm.GLM(y_train, X_train_sm_3, family = sm.families.Binomial())
res = lr3.fit()
res.summary()
Out[35]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7358
Model Family: Binomial Df Model: 33
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -1626.7
Date: Sun, 14 Mar 2021 Deviance: 3253.4
Time: 14:43:55 Pearson chi2: 9.85e+03
No. Iterations: 8
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.4414 0.168 -20.481 0.000 -3.771 -3.112
TotalVisits 0.2670 0.072 3.686 0.000 0.125 0.409
TotalTime 1.0771 0.053 20.207 0.000 0.973 1.182
PPV -0.6014 0.082 -7.347 0.000 -0.762 -0.441
Lead Origin_Landing Page Submission -0.8414 0.125 -6.716 0.000 -1.087 -0.596
Lead Source_Google -0.2865 0.117 -2.440 0.015 -0.517 -0.056
Last Activity_Email Link Clicked 1.4813 0.286 5.171 0.000 0.920 2.043
Last Activity_Email Opened 1.6938 0.153 11.091 0.000 1.394 1.993
Last Activity_Others 2.1195 0.399 5.309 0.000 1.337 2.902
Last Activity_Page Visited on Website 0.8533 0.233 3.663 0.000 0.397 1.310
Last Activity_SMS Sent 3.1402 0.163 19.241 0.000 2.820 3.460
Last Activity_Unreachable 1.3857 0.490 2.826 0.005 0.425 2.347
Specialization_Human Resource 0.1334 0.169 0.788 0.431 -0.198 0.465
Specialization_Tourism and Hospitality -0.5085 0.251 -2.023 0.043 -1.001 -0.016
HowHear_Online Search 0.5151 0.190 2.705 0.007 0.142 0.888
HowHear_SMS 0.5306 0.853 0.622 0.534 -1.141 2.202
HowHear_Social Media 0.5009 0.590 0.850 0.396 -0.655 1.656
HowHear_Word Of Mouth 0.9149 0.193 4.732 0.000 0.536 1.294
CurrentOccupation_Student 2.5303 0.428 5.905 0.000 1.691 3.370
CurrentOccupation_Unemployed 2.6494 0.117 22.641 0.000 2.420 2.879
CurrentOccupation_Working Professional 3.7634 0.275 13.673 0.000 3.224 4.303
Tags_Current Student -3.3313 0.349 -9.539 0.000 -4.016 -2.647
Tags_Diploma holder (Not Eligible) -4.4444 1.082 -4.108 0.000 -6.565 -2.324
Tags_Have Question -1.9825 1.093 -1.814 0.070 -4.124 0.159
Tags_Insterested -3.7486 0.314 -11.942 0.000 -4.364 -3.133
Tags_Interested in full time MBA -4.1018 0.789 -5.200 0.000 -5.648 -2.556
Tags_Lost 3.8084 0.395 9.651 0.000 3.035 4.582
Tags_No Response -2.1384 0.194 -11.000 0.000 -2.519 -1.757
Tags_Ringing -5.0228 0.245 -20.491 0.000 -5.503 -4.542
Tags_Will revert after reading the email 2.4284 0.175 13.870 0.000 2.085 2.772
Tags_in touch with EINS -2.6342 1.199 -2.197 0.028 -4.984 -0.285
Lead Quality_Low in Relevance 0.2026 0.227 0.892 0.372 -0.242 0.648
Lead Quality_Not Sure 0.0587 0.175 0.336 0.737 -0.284 0.401
Lead Quality_Worst -3.3694 0.524 -6.435 0.000 -4.396 -2.343
In [36]:
VIF_rank = pd.DataFrame()
VIF_rank['Features'] = X_train[col2].columns
VIF_rank['VIF'] = [variance_inflation_factor(X_train[col2].values, i) for i in range(X_train[col2].shape[1])]
VIF_rank['VIF'] = round(VIF_rank['VIF'], 2)
VIF_rank = VIF_rank.sort_values(by = "VIF", ascending = False)
VIF_rank
Out[36]:
Features VIF
18 CurrentOccupation_Unemployed 5.63
3 Lead Origin_Landing Page Submission 3.11
2 PPV 2.85
28 Tags_Will revert after reading the email 2.84
0 TotalVisits 2.63
9 Last Activity_SMS Sent 2.29
6 Last Activity_Email Opened 2.16
27 Tags_Ringing 2.07
4 Lead Source_Google 1.90
19 CurrentOccupation_Working Professional 1.85
20 Tags_Current Student 1.81
32 Lead Quality_Worst 1.69
31 Lead Quality_Not Sure 1.57
26 Tags_No Response 1.55
13 HowHear_Online Search 1.42
23 Tags_Insterested 1.39
17 CurrentOccupation_Student 1.34
16 HowHear_Word Of Mouth 1.32
1 TotalTime 1.31
30 Lead Quality_Low in Relevance 1.29
8 Last Activity_Page Visited on Website 1.27
11 Specialization_Human Resource 1.15
24 Tags_Interested in full time MBA 1.10
7 Last Activity_Others 1.09
5 Last Activity_Email Link Clicked 1.08
12 Specialization_Tourism and Hospitality 1.08
25 Tags_Lost 1.08
21 Tags_Diploma holder (Not Eligible) 1.06
10 Last Activity_Unreachable 1.05
15 HowHear_Social Media 1.04
14 HowHear_SMS 1.02
29 Tags_in touch with EINS 1.01
22 Tags_Have Question 1.01
In [37]:
col3 = col2.drop(['CurrentOccupation_Unemployed', 'Lead Origin_Landing Page Submission',
                  'PPV','Last Activity_Unreachable','Specialization_Human Resource',
                  'Specialization_Tourism and Hospitality','HowHear_SMS',
                  'HowHear_Social Media','Lead Quality_Not Sure'],1)

4.44 4th Try

In [38]:
X_train_sm_4 = sm.add_constant(X_train[col3])
lr4 = sm.GLM(y_train, X_train_sm_4, family = sm.families.Binomial())
res = lr4.fit()
res.summary()
Out[38]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7367
Model Family: Binomial Df Model: 24
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2067.7
Date: Sun, 14 Mar 2021 Deviance: 4135.4
Time: 14:43:58 Pearson chi2: 1.02e+04
No. Iterations: 8
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -2.4312 0.122 -19.999 0.000 -2.669 -2.193
TotalVisits -0.3447 0.046 -7.416 0.000 -0.436 -0.254
TotalTime 0.8888 0.045 19.748 0.000 0.801 0.977
Lead Source_Google -0.3202 0.100 -3.218 0.001 -0.515 -0.125
Last Activity_Email Link Clicked 1.4124 0.246 5.736 0.000 0.930 1.895
Last Activity_Email Opened 1.6281 0.131 12.415 0.000 1.371 1.885
Last Activity_Others 3.1076 0.348 8.939 0.000 2.426 3.789
Last Activity_Page Visited on Website 1.0652 0.198 5.382 0.000 0.677 1.453
Last Activity_SMS Sent 2.8092 0.137 20.530 0.000 2.541 3.077
HowHear_Online Search -0.0061 0.158 -0.039 0.969 -0.316 0.303
HowHear_Word Of Mouth 0.4352 0.164 2.659 0.008 0.114 0.756
CurrentOccupation_Student 0.1895 0.416 0.455 0.649 -0.627 1.006
CurrentOccupation_Working Professional 1.9097 0.242 7.894 0.000 1.436 2.384
Tags_Current Student -2.0994 0.329 -6.379 0.000 -2.744 -1.454
Tags_Diploma holder (Not Eligible) -3.1600 1.058 -2.986 0.003 -5.234 -1.086
Tags_Have Question -0.7603 1.089 -0.698 0.485 -2.895 1.375
Tags_Insterested -2.5283 0.303 -8.334 0.000 -3.123 -1.934
Tags_Interested in full time MBA -3.1309 0.827 -3.786 0.000 -4.752 -1.510
Tags_Lost 3.8092 0.366 10.416 0.000 3.092 4.526
Tags_No Response -0.7803 0.144 -5.420 0.000 -1.062 -0.498
Tags_Ringing -3.5221 0.221 -15.952 0.000 -3.955 -3.089
Tags_Will revert after reading the email 3.5386 0.155 22.859 0.000 3.235 3.842
Tags_in touch with EINS -0.8414 1.102 -0.764 0.445 -3.001 1.318
Lead Quality_Low in Relevance 0.6728 0.222 3.029 0.002 0.238 1.108
Lead Quality_Worst -2.6189 0.495 -5.288 0.000 -3.590 -1.648
In [39]:
VIF_rank = pd.DataFrame()
VIF_rank['Features'] = X_train[col3].columns
VIF_rank['VIF'] = [variance_inflation_factor(X_train[col3].values, i) for i in range(X_train[col3].shape[1])]
VIF_rank['VIF'] = round(VIF_rank['VIF'], 2)
VIF_rank = VIF_rank.sort_values(by = "VIF", ascending = False)
VIF_rank
Out[39]:
Features VIF
20 Tags_Will revert after reading the email 2.07
2 Lead Source_Google 1.81
7 Last Activity_SMS Sent 1.73
23 Lead Quality_Worst 1.60
4 Last Activity_Email Opened 1.60
12 Tags_Current Student 1.56
8 HowHear_Online Search 1.31
11 CurrentOccupation_Working Professional 1.30
19 Tags_Ringing 1.29
0 TotalVisits 1.27
22 Lead Quality_Low in Relevance 1.27
1 TotalTime 1.27
9 HowHear_Word Of Mouth 1.24
6 Last Activity_Page Visited on Website 1.14
10 CurrentOccupation_Student 1.14
15 Tags_Insterested 1.13
18 Tags_No Response 1.12
16 Tags_Interested in full time MBA 1.04
17 Tags_Lost 1.04
13 Tags_Diploma holder (Not Eligible) 1.04
3 Last Activity_Email Link Clicked 1.04
5 Last Activity_Others 1.03
14 Tags_Have Question 1.00
21 Tags_in touch with EINS 1.00
In [41]:
col4 = col3.drop(['HowHear_Online Search',
                  'HowHear_Word Of Mouth',
                  'CurrentOccupation_Student',
                  'Tags_Diploma holder (Not Eligible)',
                  'Tags_Have Question',
                  'Tags_in touch with EINS',
                  'Last Activity_Others'], 1)

4.45 5th Try - Best Model

In [42]:
X_train_sm_5 = sm.add_constant(X_train[col4])
lr5 = sm.GLM(y_train, X_train_sm_5, family = sm.families.Binomial())
res = lr5.fit()
res.summary()
Out[42]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7374
Model Family: Binomial Df Model: 17
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2132.1
Date: Sun, 14 Mar 2021 Deviance: 4264.2
Time: 14:44:47 Pearson chi2: 8.41e+03
No. Iterations: 8
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -2.0016 0.102 -19.642 0.000 -2.201 -1.802
TotalVisits -0.3067 0.045 -6.789 0.000 -0.395 -0.218
TotalTime 0.9234 0.044 20.934 0.000 0.837 1.010
Lead Source_Google -0.3447 0.090 -3.843 0.000 -0.520 -0.169
Last Activity_Email Link Clicked 0.9744 0.238 4.099 0.000 0.508 1.440
Last Activity_Email Opened 1.2098 0.115 10.495 0.000 0.984 1.436
Last Activity_Page Visited on Website 0.6375 0.187 3.405 0.001 0.271 1.004
Last Activity_SMS Sent 2.4074 0.122 19.804 0.000 2.169 2.646
CurrentOccupation_Working Professional 1.9620 0.234 8.372 0.000 1.503 2.421
Tags_Current Student -2.1264 0.329 -6.471 0.000 -2.771 -1.482
Tags_Insterested -2.5150 0.300 -8.371 0.000 -3.104 -1.926
Tags_Interested in full time MBA -3.2267 0.832 -3.876 0.000 -4.858 -1.595
Tags_Lost 3.7604 0.366 10.287 0.000 3.044 4.477
Tags_No Response -0.7765 0.144 -5.398 0.000 -1.058 -0.495
Tags_Ringing -3.4991 0.221 -15.827 0.000 -3.932 -3.066
Tags_Will revert after reading the email 3.5329 0.151 23.368 0.000 3.237 3.829
Lead Quality_Low in Relevance 0.8341 0.216 3.868 0.000 0.411 1.257
Lead Quality_Worst -2.9105 0.481 -6.050 0.000 -3.853 -1.968
In [43]:
VIF_rank = pd.DataFrame()
VIF_rank['Features'] = X_train[col4].columns
VIF_rank['VIF'] = [variance_inflation_factor(X_train[col4].values, i) for i in range(X_train[col4].shape[1])]
VIF_rank['VIF'] = round(VIF_rank['VIF'], 2)
VIF_rank = VIF_rank.sort_values(by = "VIF", ascending = False)
VIF_rank
Out[43]:
Features VIF
14 Tags_Will revert after reading the email 2.03
6 Last Activity_SMS Sent 1.71
4 Last Activity_Email Opened 1.57
16 Lead Quality_Worst 1.53
8 Tags_Current Student 1.51
2 Lead Source_Google 1.47
7 CurrentOccupation_Working Professional 1.28
13 Tags_Ringing 1.28
1 TotalTime 1.26
15 Lead Quality_Low in Relevance 1.25
0 TotalVisits 1.24
5 Last Activity_Page Visited on Website 1.13
9 Tags_Insterested 1.12
12 Tags_No Response 1.12
3 Last Activity_Email Link Clicked 1.04
10 Tags_Interested in full time MBA 1.04
11 Tags_Lost 1.04

Performance Evaluation

In [60]:
X_test_sm_5 = sm.add_constant(X_test[col4])
y_pred_LR_probability = res.predict(X_test_sm_5)
In [61]:
y_pred_LR_probability
Out[61]:
1934    0.483288
4442    0.010559
494     0.956996
1320    0.152994
2236    0.921193
          ...   
2077    0.025997
5873    0.483288
7998    0.005069
4192    0.997729
1337    0.074426
Length: 1848, dtype: float64
In [62]:
y_pred_LR_probability = y_pred_LR_probability.values.reshape(-1)
y_pred_LR_probability[:10]
Out[62]:
array([0.48328798, 0.01055936, 0.95699627, 0.15299431, 0.92119349,
       0.88998633, 0.00067673, 0.99583497, 0.51382854, 0.96970545])
In [177]:
import numpy as np 
np.set_printoptions(suppress=True); 
In [63]:
y_pred_LR = (y_pred_LR_probability > 0.5).astype(int)
In [64]:
y_pred_LR
Out[64]:
array([0, 0, 1, ..., 0, 1, 0])
In [72]:
flat_accuracy(y_pred_LR, y_test)
flat_precision(y_pred_LR, y_test)
flat_recall(y_pred_LR, y_test)
flat_f1(y_pred_LR, y_test)
Accuracy = 87.8788%
Precision = 88.2442%
Recall = 85.9996%
F1 = 86.8539%
In [66]:
plot_cm(y_pred_LR, y_test, 'Logistic Regression')

4.46 Model Implications

In [76]:
# Features Coefficients 
pd.options.display.float_format = '{:.2f}'.format
fit_parameters = res.params[1:]
fit_parameters
Out[76]:
TotalVisits                                -0.31
TotalTime                                   0.92
Lead Source_Google                         -0.34
Last Activity_Email Link Clicked            0.97
Last Activity_Email Opened                  1.21
Last Activity_Page Visited on Website       0.64
Last Activity_SMS Sent                      2.41
CurrentOccupation_Working Professional      1.96
Tags_Current Student                       -2.13
Tags_Insterested                           -2.51
Tags_Interested  in full time MBA          -3.23
Tags_Lost                                   3.76
Tags_No Response                           -0.78
Tags_Ringing                               -3.50
Tags_Will revert after reading the email    3.53
Lead Quality_Low in Relevance               0.83
Lead Quality_Worst                         -2.91
dtype: float64

Logistic Regression Model Equation

logit(p) = log(p/(1-p))= β0 + β1* X1 + … + βn * Xn
Put the coefficient numbers from the best model to the logistic regression, and then we get the following equation.

logit(p) = +3.76*Tags_Lost
           +3.53*Tags_Will revert after reading the email
           +2.41*Last Activity_SMS Sent
           +1.96*CurrentOccupation_Working Professional
           +1.21*Last Activity_Email Opened
           +0.97*Last Activity_Email Link Clicked
           +0.92*TotalTime
           +0.83*Lead Quality_Low in Relevance
           +0.64*Last Activity_Page Visited on Website
           -0.31*TotalVisits
           -0.34*Lead Source_Google
           -0.78*Tags_No Response
           -2.13*Tags_Current Student
           -2.51*Tags_Insterested
           -2.91*Lead Quality_Worst
           -3.23*Tags_Interested in full time MBA
           -3.50*Tags_Ringing
           -19.64

Findings:

The regression coefficients show the change in log(odds) in Converted for a unit change in the predictor variable, holding all other predictor variables constant.

In this sense, the above euqation can be interpreted in this way:

  • A 1-unit increase in the number of Total Time is associated with an increase in the odds of conversion by a factor of 2.5=exp(0.92), about an 150% increase, hold everything else constant.
  • The odds of CurrentOccupation_Working Professional having converted is about exp(1.69)=7.1, 610% higher than the odds of non-working professional doing so, hold everything else constant.
  • And etc.
  • Roughly speaking, for occupation, working professional is positively associated with conversion.
  • For last activity, leads who sent SMS, opened/clicked Email, or visited website pages are positively related with conversion.
  • Tags with education info are all negatively related with conversion. It implies that Diploma-related program didn’t sell well.
  • Lead quality label correctly reflects conversion.
  • Leads tend to convert when they spend more time on website while total visits are not necessarily the case.

5.0 Lead Scoring System

image.png

After comparing these four models, I find that CatBoost wins so I will use CatBoost model to predict conversion probability and then build lead scoring system.

In [129]:
from catboost import CatBoostClassifier;
classifier_CatBoost = CatBoostClassifier(logging_level = 'Silent') 
classifier_CatBoost.fit(X_train, y_train)
y_pred_CatBoost = classifier_CatBoost.predict(X_test)
In [130]:
y_pred_CatBoost
Out[130]:
array([0, 0, 1, ..., 0, 1, 0])
In [131]:
# Get predicted probability 
y_pred_CatBoost_pro = classifier_CatBoost.predict_proba(X_test)
y_pred_CatBoost_pro =[i[1] for i in y_pred_CatBoost_pro]
In [230]:
# Create a dataframe with the actual conversion, Predicted Probabilities and Predicted Conversion
Lead_Scoring = pd.DataFrame({'Conversion':y_test.values, 'Conversion Probability':y_pred_CatBoost_pro})
Lead_Scoring['Prediction Class'] = y_pred_CatBoost
Lead_Scoring.head(10)
Out[230]:
Conversion Conversion Probability Prediction Class
0 0 0.47 0
1 1 0.00 0
2 1 0.99 1
3 1 0.01 0
4 0 0.98 1
5 0 0.97 1
6 0 0.00 0
7 1 0.99 1
8 0 0.06 0
9 1 0.99 1

Conversion Probability multiplies by 100 and then get the lead score.

In [134]:
Lead_Scoring['Lead Score'] = Lead_Scoring['Conversion Probability'].map( lambda x: round(x*100))
Lead_Scoring.head()
Out[134]:
Converted Conversion Probability Prediction Class Lead Score
0 1 0.47 0 47
1 0 0.00 0 0
2 1 0.99 1 99
3 0 0.01 0 1
4 1 0.98 1 98
In [135]:
# Define a function for putting lead score into four buckets
def lead_bucket(x):
    if x<25:
        return "Cold"
    elif x>=25 and x<50:
        return "Cool"
    elif x>=51 and x<76:
        return "Warm"
    else:
        return "Hot"
    

Lead_Scoring['Lead Buckets'] = Lead_Scoring['Lead Score'].apply(lead_bucket)

Change the order of the Columns so that first column is the actual conversion, the second one is the predicted conversion for better observation.

In [140]:
Lead_Scoring = Lead_Scoring[['Converted', 'Prediction Class', 'Conversion Probability', 
                             'Lead Score', 'Lead Buckets']]
In [142]:
Lead_Scoring.tail(20)
Out[142]:
Converted Prediction Class Conversion Probability Lead Score Lead Buckets
1828 0 0 0.04 4 Cold
1829 0 0 0.01 1 Cold
1830 0 0 0.01 1 Cold
1831 0 0 0.00 0 Cold
1832 1 1 0.94 94 Hot
1833 0 0 0.18 18 Cold
1834 0 0 0.01 1 Cold
1835 0 0 0.07 7 Cold
1836 0 0 0.01 1 Cold
1837 1 1 0.72 72 Warm
1838 1 1 0.98 98 Hot
1839 0 0 0.06 6 Cold
1840 1 1 1.00 100 Hot
1841 0 0 0.01 1 Cold
1842 1 0 0.32 32 Cool
1843 0 0 0.07 7 Cold
1844 1 1 0.99 99 Hot
1845 0 0 0.00 0 Cold
1846 1 1 0.97 97 Hot
1847 0 0 0.31 31 Cool

Thanks!